Embedding Models: from Architecture to Implementation

Welcome to Embedding Models: from Architecture to Implementation.

Built in partnership with Vectara

Embedding models create the embedding vectors that make it possible to build semantic or meaning-based retrieval systems.
This course will describe their history, detailed technical architecture, and implementation.
This is a technical course focusing on building blocks rather than applications.

You may have heard of Embedding Vectors being used in Generative AI applications. These vectors have an amazing ability to capture the meaning of a word or phrase.

Introduction to Embedding Models

In this lesson, you will learn:

What vector embeddings are
Their main applications in natural language processing
From token embeddings to sentence embeddings
How sentence embeddings are used in RAG

Vector Embeddings

Vector embeddings map real-world entities such as a word, sentence, or image into vector representations, or a point in some vector space.

A key characteristic is that points in vector space that are similar to each other have a similar semantic meaning.

Word Embeddings

Word2Vec was the pioneering work on learning token or word embeddings that maintain semantic meaning.

These word embedding vectors behave like vectors in a vector space, allowing algebraic operations:

queen - woman + man ≈ king

Example from Star Wars text:

Yoda - good + evil ≈ Vader

A sentence embedding model applies the same principle to complete sentences, converting a sentence into a vector that represents its semantic meaning.

Applications of Vector Embeddings

Key Applications:

Building LLMs: Token embeddings are used in transformer models to represent tokens.
Semantic Search: Sentence embeddings power semantic search, also known as vector or neural search and the retrieval engine in a RAG pipeline.
Product Recommendations: Embedding vectors represent products, and recommendations are made based on similarity between those vectors.
Anomaly Detection: Embedding vectors can be used for anomaly detection using typical approaches in the embedding space.

Retrieval in RAG

A critical component of any good RAG pipeline is the retrieval engine.

How it works:

Given a user query, rank order all possible facts or text chunks by relevance to the query
Send the facts to the LLM for generating a response

Approaches for Ranking Text Chunks:

Cross Encoder: A transformer-based neural network model used as a classifier to determine relevance. However, it's very slow and doesn't scale well.
Sentence Embedding Models: Create an embedding for each text segment during indexing and use similarity search to identify the best matching chunks. Less accurate but much faster.

Contextualized Token Embeddings

In this lesson, you will learn:

The importance of contextualized token embeddings
How transformer models, specifically BERT, pioneered contextualized embeddings
How these are used in sentence embedding models

Problem with Word Embeddings

Word embedding models like Word2Vec and GloVe don't understand context:

                "The bat flew out of the cave at night."

                "He swung the bat and hit the home run."

Using these models, both instances of "bat" would have the same vector embedding despite different meanings.

Transformer Architecture

In 2017, the paper "Attention is all you need" introduced the transformer architecture to NLP.

The transformer architecture was originally designed for translation tasks and had two components:

Encoder: Takes a sequence of words/tokens and produces a sequence of continuous representations. Can attend to tokens to the left or right.
Decoder: Operates one token at a time, considering predicted tokens so far along with encoder outputs. Only attends to inputs to the left.

Encoder output vectors are the contextualized vectors we're looking for.

BERT Model

BERT is an encoder-only transformer model heavily used in sentence embedding models.

BERT Specifications:

BERT Base: 12 transformer layers, 110 million parameters
BERT Large: 24 layers, 340 million parameters
Pre-trained on 3.3 billion words
Often used with additional task-specific fine-tuning

BERT Pre-training Tasks:

Masked Language Modeling (MLM): 15% of input words are masked, and the model predicts those masked words.
Next Sentence Prediction (NSP): The model predicts if one sentence is likely to follow another.

Token vs. Sentence Embedding

In this lesson, you will learn:

About sentence embeddings
How early and naive attempts to create them failed
What led to the successful approach of using a dual encoder architecture

Tokenization in NLP

NLP systems deal with tokens, which can be:

Whole words
Subwords (using techniques like BPE, WordPiece, or SentencePiece)
Any sequence of characters

Each sentence is represented by a sequence of integer values corresponding to tokens.

Token Embeddings in BERT

BERT has a vocabulary of about 30,000 tokens and an embedding dimension of 768.

How Token Embeddings Work in BERT:

The tokenizer input sentence is prepended with a special CLS token
All tokens are converted to token embeddings (fixed embeddings focusing on the word itself)
The output of each encoder layer provides contextualized embeddings that integrate information about the rest of the sentence
As we go from layer to layer, these representations become better at integrating context

Creating Sentence Embeddings

After the success of word embeddings, researchers explored creating embedding vectors for sentences.

Initial (Failed) Approaches:

Mean Pooling: Taking the output embeddings of the last layer of a transformer model of all tokens in the sentence and averaging them.
CLS Token Embedding: Using just the embeddings of the CLS token as the representative of the sentence.

These approaches failed because they didn't properly capture the semantic meaning of the entire sentence.

Dual Encoder Architecture

Real progress in sentence embeddings came with the introduction of the dual encoder architecture.

Two Possible Goals for Sentence Encoders:

Pure Sentence Similarity: Finding similar items using embeddings
Ranking Relevant Sentences: Finding responses to questions (e.g., in RAG)

These are not the same goal. For example, for the question "What is the tallest mountain in the world?", we want the answer "Mount Everest is the tallest" rather than the same question as the answer.

The dual encoder architecture has two separate encoders (question encoder and answer encoder) and is trained using a contrastive loss.

Training a Dual Encoder

In this lesson, you will learn:

How to build a dual encoder in PyTorch
How to train it using a dataset of question and answer pairs

Dual Encoder Architecture:

Two independent BERT encoders (one for questions, one for answers)
Use the CLS embedding vector from the last layer as the vector embedding
Dot product similarity represents semantic match
Utilizes a contrastive loss function

Contrastive Loss

The idea behind contrastive loss is to ensure that:

Embeddings of similar or positive pairs are closer together in the embedding space
Representations of dissimilar or negative pairs are further apart

In our context:

Positive pair: embeddings of a question and its correct answer
Negative pair: a question and any other answer considered wrong in the batch

In PyTorch, we use cross-entropy loss with a trick: set the target argument to be zero, one, two, etc., indicating that the correct answer for each question is the one associated with it (the diagonal).

Building the Encoder

Encoder Components:

nn.Embedding: Maps tokens to token embeddings
nn.TransformerEncoderLayer: Processes embeddings to create contextualized embeddings
CLS Token: Special token that learns a rough embedding of the whole sentence
Projection Layer: Learns an additional transformation to potentially reduce embedding size

The final output is a contextualized embedding that can be used for similarity comparisons.

Training Loop

Training Process:

Define parameters (embedding size, output embedding, max sequence size, batch size)
Create question and answer encoders with a tokenizer
Load the dataset with a data loader
Define an optimizer (Adam) and loss function (cross-entropy)
Iterate through the data loader in batches
Tokenize questions and answers
Compute embeddings using the encoders
Calculate similarity scores and contrastive loss
Perform backpropagation and optimizer steps
Repeat for multiple epochs

Using Embeddings in RAG

In this lesson, you will learn:

How to use sentence embedding models in production
How question and answer encoders are used in a retrieval pipeline

RAG Pipeline with Dual Encoder:

During Ingest: Encode each text chunk using the answer encoder and store the resulting vector in a vector database
During Query: Use the question encoder to generate the query embedding vector
Retrieval: Use the vector to retrieve matching facts or text segments
Generation: Send retrieved text to the LLM as part of the RAG flow

Approximate Nearest Neighbors

Finding matching chunks by computing similarity between the question embedding and all answer embeddings is computationally heavy.

Instead, we use Approximate Nearest Neighbors (ANN) algorithms:

HNSW (Hierarchical Navigable Small World)
Annoy (Approximate Nearest Neighbors Oh Yeah)
FAISS (Facebook AI Similarity Search)

These algorithms approximate nearest neighbor searches with high accuracy but significantly lower compute time.

For large datasets, implement ANN using a persistent data store on disk.

Full RAG Pipeline

RAG Implementation Options:

Code from Scratch: Build the entire pipeline yourself
DIY Frameworks: Use frameworks like LangChain or LlamaIndex
RAG as a Service: Use platforms like Vectara that handle most of the heavy lifting

Full Pipeline Flow:

Ingest (Blue Lines): Input documents → Chunk → Encode with answer encoder → Store in vector database
Query (Green Lines): User query → Encode with question encoder → Retrieve with ANN → Include in prompt → Generate with LLM

Conclusion

In this course, you learned about:

Token embeddings and sentence embeddings
How they are created and trained
The importance of specialized models for sentence representation
The dual encoder architecture and contrastive loss

Two-Stage Retrieval Pipeline

A common practical approach is the two-stage retrieval (Retrieve and Rerank):

Use a sentence embedding model to retrieve the top 100 matching documents
Use a cross encoder-based reranker to hone in on the top 10 best matches
This provides a good trade-off between performance, latency, and accuracy

Additional Retrieval Techniques

While embedding models are essential for RAG, other retrieval techniques can complement neural search:

Hybrid Search: Combines neural search with traditional keyword-based search
Metadata Filtering: Filter facts by metadata values (e.g., only include facts from papers by a certain author)
Max Marginal Relevance (MMR): Balances retrieval relevance with diversity of results

These techniques help ensure that the facts getting to the LLM are the most appropriate for responding to the user query.

Thank you for joining us to learn about sentence embeddings!